Abstract

In order to optimize dataframe:

Downcasting numeric columns

df_num = df.select_dtypes(include=['int64','float64'])
converted_num = df_num.apply(pd.to_numeric,downcast='unsigned')

Converting string column to categorical type

When we convert columns to category, it's important to be aware of trade-off:

we can't perform numerical computation on category
we should use the category type primarily for object column where less than 50% of the values are unique

gl_obj = gl.select_dtypes(include=['object']).copy()

converted_obj = pd.DataFrame()

for col in gl_obj.columns:
    num_unique_values = len(gl_obj[col].unique())
    num_total_values = len(gl_obj[col])
    if num_unique_values / num_total_values < 0.5:
        converted_obj.loc[:,col] = gl_obj[col].astype('category')
    else:
        converted_obj.loc[:,col] = gl_obj[col]

Converting date

df['date'] = pd.to_datetime(date,format='%Y%m%d')

Specify type when load data

dtypes = optimized_gl.drop('date',axis=1).dtypes
dtypes_col = dtypes.index
dtypes_type=[i.name for i in dtypes.values]
column_types = dict(zip(dtypes_col, dtypes_type))
read_and_optimized = pd.read_csv('..\data\game_logs.csv',dtype=column_types,parse_dates=['date'],infer_datetime_format=True)

Tip to Reduce memory usage



In [9]:

    
import os
import pandas as pd



In [10]:

    
# Load Data
gl = pd.read_csv('..\data\game_logs.csv')
# Available also at https://data.world/dataquest/mlb-game-logs









    



C:\anaconda\lib\site-packages\IPython\core\interactiveshell.py:2717: DtypeWarning: Columns (12,13,14,15,19,20,81,83,85,87,93,94,95,96,97,98,99,100,105,106,108,109,111,112,114,115,117,118,120,121,123,124,126,127,129,130,132,133,135,136,138,139,141,142,144,145,147,148,150,151,153,154,156,157,160) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)



In [11]:

    
# Data Preview
gl.head()









    Out[11]:






  
    
      
      date
      number_of_game
      day_of_week
      v_name
      v_league
      v_game_number
      h_name
      h_league
      h_game_number
      v_score
      ...
      h_player_7_name
      h_player_7_def_pos
      h_player_8_id
      h_player_8_name
      h_player_8_def_pos
      h_player_9_id
      h_player_9_name
      h_player_9_def_pos
      additional_info
      acquisition_info
    
  
  
    
      0
      18710504
      0
      Thu
      CL1
      na
      1
      FW1
      na
      1
      0
      ...
      Ed Mincher
      7.0
      mcdej101
      James McDermott
      8.0
      kellb105
      Bill Kelly
      9.0
      NaN
      Y
    
    
      1
      18710505
      0
      Fri
      BS1
      na
      1
      WS3
      na
      1
      20
      ...
      Asa Brainard
      1.0
      burrh101
      Henry Burroughs
      9.0
      berth101
      Henry Berthrong
      8.0
      HTBF
      Y
    
    
      2
      18710506
      0
      Sat
      CL1
      na
      2
      RC1
      na
      1
      12
      ...
      Pony Sager
      6.0
      birdg101
      George Bird
      7.0
      stirg101
      Gat Stires
      9.0
      NaN
      Y
    
    
      3
      18710508
      0
      Mon
      CL1
      na
      3
      CH1
      na
      1
      12
      ...
      Ed Duffy
      6.0
      pinke101
      Ed Pinkham
      5.0
      zettg101
      George Zettlein
      1.0
      NaN
      Y
    
    
      4
      18710509
      0
      Tue
      BS1
      na
      2
      TRO
      na
      1
      9
      ...
      Steve Bellan
      5.0
      pikel101
      Lip Pike
      3.0
      cravb101
      Bill Craver
      6.0
      HTBF
      Y
    
  

5 rows × 161 columns

Original Article https://www.dataquest.io/blog/pandas-big-data/



In [14]:

    
gl.dtypes.head()









    Out[14]:





date               int64
number_of_game     int64
day_of_week       object
v_name            object
v_league          object
dtype: object



In [37]:

    
# Select only the column with same type
gl.select_dtypes(include=['object']).head()









    Out[37]:






  
    
      
      day_of_week
      v_name
      v_league
      h_name
      h_league
      day_night
      completion
      forefeit
      protest
      park_id
      ...
      h_player_6_id
      h_player_6_name
      h_player_7_id
      h_player_7_name
      h_player_8_id
      h_player_8_name
      h_player_9_id
      h_player_9_name
      additional_info
      acquisition_info
    
  
  
    
      0
      Thu
      CL1
      na
      FW1
      na
      D
      NaN
      NaN
      NaN
      FOR01
      ...
      caret101
      Tom Carey
      mince101
      Ed Mincher
      mcdej101
      James McDermott
      kellb105
      Bill Kelly
      NaN
      Y
    
    
      1
      Fri
      BS1
      na
      WS3
      na
      D
      NaN
      NaN
      NaN
      WAS01
      ...
      leona101
      Andy Leonard
      braia102
      Asa Brainard
      burrh101
      Henry Burroughs
      berth101
      Henry Berthrong
      HTBF
      Y
    
    
      2
      Sat
      CL1
      na
      RC1
      na
      D
      NaN
      NaN
      NaN
      RCK01
      ...
      ansoc101
      Cap Anson
      sagep101
      Pony Sager
      birdg101
      George Bird
      stirg101
      Gat Stires
      NaN
      Y
    
    
      3
      Mon
      CL1
      na
      CH1
      na
      D
      NaN
      NaN
      NaN
      CHI01
      ...
      folet101
      Tom Foley
      duffe101
      Ed Duffy
      pinke101
      Ed Pinkham
      zettg101
      George Zettlein
      NaN
      Y
    
    
      4
      Tue
      BS1
      na
      TRO
      na
      D
      NaN
      NaN
      NaN
      TRO01
      ...
      beave101
      Edward Beavens
      bells101
      Steve Bellan
      pikel101
      Lip Pike
      cravb101
      Bill Craver
      HTBF
      Y
    
  

5 rows × 78 columns



In [16]:

    
#Exact amount of memory usage of df
gl.info(memory_usage='deep')









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171907 entries, 0 to 171906
Columns: 161 entries, date to acquisition_info
dtypes: float64(77), int64(6), object(78)
memory usage: 860.5 MB

First Look



In [18]:

    
gl.describe()









    Out[18]:






  
    
      
      date
      number_of_game
      v_game_number
      h_game_number
      v_score
      h_score
      length_outs
      attendance
      length_minutes
      v_at_bats
      ...
      v_player_9_def_pos
      h_player_1_def_pos
      h_player_2_def_pos
      h_player_3_def_pos
      h_player_4_def_pos
      h_player_5_def_pos
      h_player_6_def_pos
      h_player_7_def_pos
      h_player_8_def_pos
      h_player_9_def_pos
    
  
  
    
      count
      1.719070e+05
      171907.000000
      171907.000000
      171907.000000
      171907.000000
      171907.000000
      140841.000000
      118877.000000
      136701.000000
      140838.000000
      ...
      140835.000000
      140838.000000
      140838.000000
      140838.000000
      140838.000000
      140838.000000
      140838.000000
      140838.000000
      140838.000000
      140838.000000
    
    
      mean
      1.953462e+07
      0.260897
      76.929887
      76.953806
      4.420582
      4.701461
      53.619976
      20184.247188
      150.903329
      34.914398
      ...
      1.894778
      6.462560
      5.914767
      6.435770
      5.922982
      5.786144
      5.497628
      4.956184
      4.080078
      1.895873
    
    
      std
      4.149326e+05
      0.605667
      45.178029
      45.162564
      3.278489
      3.355605
      5.571512
      14257.381902
      34.748160
      4.633761
      ...
      1.939072
      1.813946
      1.946100
      2.306364
      2.529624
      2.521525
      2.446669
      2.264262
      2.167098
      1.936534
    
    
      min
      1.871050e+07
      0.000000
      1.000000
      1.000000
      0.000000
      0.000000
      0.000000
      0.000000
      24.000000
      12.000000
      ...
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
    
    
      25%
      1.918052e+07
      0.000000
      38.000000
      38.000000
      2.000000
      2.000000
      51.000000
      7962.000000
      125.000000
      32.000000
      ...
      1.000000
      5.000000
      4.000000
      4.000000
      3.000000
      3.000000
      3.000000
      3.000000
      2.000000
      1.000000
    
    
      50%
      1.953053e+07
      0.000000
      76.000000
      76.000000
      4.000000
      4.000000
      54.000000
      18639.000000
      150.000000
      34.000000
      ...
      1.000000
      7.000000
      6.000000
      7.000000
      7.000000
      6.000000
      5.000000
      5.000000
      4.000000
      1.000000
    
    
      75%
      1.989051e+07
      0.000000
      115.000000
      115.000000
      6.000000
      6.000000
      54.000000
      31242.000000
      173.000000
      37.000000
      ...
      1.000000
      8.000000
      8.000000
      8.000000
      8.000000
      8.000000
      8.000000
      6.000000
      6.000000
      1.000000
    
    
      max
      2.016100e+07
      3.000000
      165.000000
      165.000000
      49.000000
      38.000000
      156.000000
      99027.000000
      1245.000000
      85.000000
      ...
      10.000000
      10.000000
      10.000000
      10.000000
      10.000000
      10.000000
      10.000000
      10.000000
      10.000000
      10.000000
    
  

8 rows × 83 columns



In [24]:

    
# Reference http://www.markhneedham.com/blog/2017/07/05/pandas-find-rows-where-columnfield-is-null/
# Columns with null values

null_columns=gl.columns[gl.isnull().any()]
gl[null_columns].isnull().sum()









    Out[24]:





length_outs                      31066
day_night                        31757
completion                      171791
forefeit                        171762
protest                         171727
attendance                       53030
length_minutes                   35206
v_line_score                     24636
h_line_score                     24636
v_at_bats                        31069
v_hits                           31069
v_doubles                        31069
v_triples                        31069
v_homeruns                       31069
v_rbi                            31069
v_sacrifice_hits                 31069
v_sacrifice_flies                31069
v_hit_by_pitch                   31069
v_walks                          31069
v_intentional walks              31069
v_strikeouts                     31069
v_stolen_bases                   31069
v_caught_stealing                31069
v_grounded_into_double           31069
v_first_catcher_interference     31069
v_left_on_base                   31069
v_pitchers_used                  31069
v_individual_earned_runs         31069
v_team_earned_runs               31069
v_wild_pitches                   31069
                                 ...  
v_player_9_def_pos               31072
h_player_1_id                    31069
h_player_1_name                  31069
h_player_1_def_pos               31069
h_player_2_id                    31069
h_player_2_name                  31069
h_player_2_def_pos               31069
h_player_3_id                    31069
h_player_3_name                  31069
h_player_3_def_pos               31069
h_player_4_id                    31069
h_player_4_name                  31069
h_player_4_def_pos               31069
h_player_5_id                    31069
h_player_5_name                  31069
h_player_5_def_pos               31069
h_player_6_id                    31069
h_player_6_name                  31069
h_player_6_def_pos               31069
h_player_7_id                    31069
h_player_7_name                  31069
h_player_7_def_pos               31069
h_player_8_id                    31069
h_player_8_name                  31069
h_player_8_def_pos               31069
h_player_9_id                    31069
h_player_9_name                  31069
h_player_9_def_pos               31069
additional_info                 170451
acquisition_info                 31066
dtype: int64



In [26]:

    
# Every row that contains at least one null value
print(gl[gl.isnull().any(axis=1)][null_columns].head())









    



   length_outs day_night completion forefeit protest  attendance  \
0         54.0         D        NaN      NaN     NaN       200.0   
1         54.0         D        NaN      NaN     NaN      5000.0   
2         54.0         D        NaN      NaN     NaN      1000.0   
3         54.0         D        NaN      NaN     NaN      5000.0   
4         54.0         D        NaN      NaN     NaN      3250.0   

   length_minutes v_line_score h_line_score  v_at_bats        ...         \
0           120.0    000000000    010010000       30.0        ...          
1           145.0    107000435    640113030       41.0        ...          
2           140.0    610020003    010020100       49.0        ...          
3           150.0    101403111    077000000       46.0        ...          
4           145.0    000002232    101003000       46.0        ...          

   h_player_7_name  h_player_7_def_pos  h_player_8_id  h_player_8_name  \
0       Ed Mincher                 7.0       mcdej101  James McDermott   
1     Asa Brainard                 1.0       burrh101  Henry Burroughs   
2       Pony Sager                 6.0       birdg101      George Bird   
3         Ed Duffy                 6.0       pinke101       Ed Pinkham   
4     Steve Bellan                 5.0       pikel101         Lip Pike   

   h_player_8_def_pos  h_player_9_id  h_player_9_name  h_player_9_def_pos  \
0                 8.0       kellb105       Bill Kelly                 9.0   
1                 9.0       berth101  Henry Berthrong                 8.0   
2                 7.0       stirg101       Gat Stires                 9.0   
3                 5.0       zettg101  George Zettlein                 1.0   
4                 3.0       cravb101      Bill Craver                 6.0   

   additional_info  acquisition_info  
0              NaN                 Y  
1             HTBF                 Y  
2              NaN                 Y  
3              NaN                 Y  
4             HTBF                 Y  

[5 rows x 145 columns]

Under the hood, pandas groups the columns into block of values of the same type:

ObjectBlock, contains string
FloatBlock (ndarray)
IntBlock (ndarray)

Because each data type is stored separately, we examine the memory usage by data type.

Average memory usage for data type



In [62]:

    
gl.dtypes.value_counts()









    Out[62]:





object     78
float64    77
int64       6
dtype: int64



In [60]:

    
for dtype in gl.dtypes.unique(): #['float','int64','object']:
    selected_dtype = gl.select_dtypes(include=[dtype])
    mean_usage_b = selected_dtype.memory_usage(deep=True).mean()
    mean_usage_mb = mean_usage_b / 1024 ** 2
    print("Average memory usage for {} columns: {:03.2f} MB".format(dtype,mean_usage_mb))









    



Average memory usage for int64 columns: 1.12 MB
Average memory usage for object columns: 9.51 MB
Average memory usage for float64 columns: 1.29 MB

Subtype

Under the hood pandas represents numeric values as NumPy ndarrays and stores them in a continuous block of memory. This approach:

consumes less space
allows us to access quickly

Many types in pandas have multiple subtypes that can use fewer bytes to represent each value. For example, the float type has the float16, float32, and float64 subtypes.

To discovare the range of values of given dtype we can use https://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.iinfo.html



In [63]:

    
import numpy as np
int_types = ["uint8", "int8", "int16"]
for it in int_types:
    print(np.iinfo(it))









    



Machine parameters for uint8
---------------------------------------------------------------
min = 0
max = 255
---------------------------------------------------------------

Machine parameters for int8
---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------

Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------

Optimize Numeric Columns with subtypes

We can use the function pd.to_numeric() to downcast our numeric types



In [64]:

    
# We're going to be calculating memory usage a lot,
# so we'll create a function to save us some time!

def mem_usage(pandas_obj):
    if isinstance(pandas_obj,pd.DataFrame):
        usage_b = pandas_obj.memory_usage(deep=True).sum()
    else: # we assume if not a df it's a series
        usage_b = pandas_obj.memory_usage(deep=True)
    usage_mb = usage_b / 1024 ** 2 # convert bytes to megabytes
    return "{:03.2f} MB".format(usage_mb)



In [65]:

    
mem_usage(gl)









    Out[65]:





'860.50 MB'



In [68]:

    
gl_int = gl.select_dtypes(include=['int64'])
converted_int = gl_int.apply(pd.to_numeric,downcast='unsigned')



In [69]:

    
print(mem_usage(gl_int))
print(mem_usage(converted_int))









    



7.87 MB
1.48 MB



In [70]:

    
compare_ints = pd.concat([gl_int.dtypes,converted_int.dtypes],axis=1)
compare_ints.columns = ['before','after']
compare_ints.apply(pd.Series.value_counts)









    Out[70]:






  
    
      
      before
      after
    
  
  
    
      uint8
      NaN
      5.0
    
    
      uint32
      NaN
      1.0
    
    
      int64
      6.0
      NaN

We can see a drop from 7.9 to 1.5 MB. Now we have 5 uint8 and 1 unit32 instead of 6 int64.

Lets do the same thing with float columns



In [72]:

    
gl_float = gl.select_dtypes(include=['float'])
converted_float = gl_float.apply(pd.to_numeric,downcast='float')

print(mem_usage(gl_float))
print(mem_usage(converted_float))

compare_floats = pd.concat([gl_float.dtypes,converted_float.dtypes],axis=1)
compare_floats.columns = ['before','after']
compare_floats.apply(pd.Series.value_counts)









    



100.99 MB
50.49 MB






    Out[72]:






  
    
      
      before
      after
    
  
  
    
      float32
      NaN
      77.0
    
    
      float64
      77.0
      NaN

All our float columns were converted from float64 to float32, give us a 50 reduction in memory usage.

We apply this optimizations on the entire df



In [73]:

    
optimized_gl = gl.copy()

optimized_gl[converted_int.columns] = converted_int
optimized_gl[converted_float.columns] = converted_float

print(mem_usage(gl))
print(mem_usage(optimized_gl))









    



860.50 MB
803.61 MB

In order to have the best benefit, we have to optimize the object types.

Comparing Numeric and String Storage

The object type represents values using Python string objects, partly due to the lack of support for missing string values in NumPy. Because Python is a high-level, interpreted language, it doesn’t have fine grained-control over how values in memory are stored.

This limitation causes strings to be stored in a fragmented way that consumes more memory and is slower to access. Each element in an object column is really a pointer that contains the “address” for the actual value’s location in memory.

Object types as using a variable amount of memory. While each pointer takes up 1 byte of memory, each actual string value uses the same amount of memory that string would use if stored individually in Python. Let’s use sys.getsizeof() to prove that out, first by looking at individual strings, and then items in a pandas series.



In [74]:

    
from sys import getsizeof

s1 = 'working out'
s2 = 'memory usage for'
s3 = 'strings in python is fun!'
s4 = 'strings in python is fun!'
          
for s in [s1, s2, s3, s4]:
    print(getsizeof(s))



In [76]:

    
obj_series = pd.Series(['working out',
                          'memory usage for',
                          'strings in python is fun!',
                          'strings in python is fun!'])
obj_series.apply(getsizeof)









    Out[76]:





0    60
1    65
2    74
3    74
dtype: int64

You can see that the size of strings when stored in a pandas series are identical to their usage as separate strings in Python.

Optimizing object types using categoricals

The category type uses integer values under the hood to represent the values in a column, rather than the raw values. Pandas uses a separate mapping dictionary that maps the integer values to the raw ones. This arrangement is useful whenever a column contains a limited set of values. When we convert a column to the category dtype, pandas uses the most space efficient int subtype that can represent all of the unique values in a column.



In [78]:

    
#Where we migh be able to reduce memory?
gl_obj = gl.select_dtypes(include=['object']).copy()
gl_obj.describe()









    Out[78]:






  
    
      
      day_of_week
      v_name
      v_league
      h_name
      h_league
      day_night
      completion
      forefeit
      protest
      park_id
      ...
      h_player_6_id
      h_player_6_name
      h_player_7_id
      h_player_7_name
      h_player_8_id
      h_player_8_name
      h_player_9_id
      h_player_9_name
      additional_info
      acquisition_info
    
  
  
    
      count
      171907
      171907
      171907
      171907
      171907
      140150
      116
      145
      180
      171907
      ...
      140838
      140838
      140838
      140838
      140838
      140838
      140838
      140838
      1456
      140841
    
    
      unique
      7
      148
      7
      148
      7
      2
      116
      3
      5
      245
      ...
      4774
      4720
      5253
      5197
      4760
      4710
      5193
      5142
      332
      1
    
    
      top
      Sat
      CHN
      NL
      CHN
      NL
      D
      20010616,,0,0,6
      H
      V
      STL07
      ...
      grimc101
      Charlie Grimm
      grimc101
      Charlie Grimm
      lopea102
      Al Lopez
      spahw101
      Warren Spahn
      HTBF
      Y
    
    
      freq
      28891
      8870
      88866
      9024
      88867
      82724
      1
      69
      90
      7022
      ...
      427
      427
      491
      491
      676
      676
      339
      339
      1112
      140841
    
  

4 rows × 78 columns

A quick glance reveals many columns where there are few unique values relative to the overall ~172,000 games in our data set. We start to convert day_of_week to category using the .astype(method).



In [79]:

    
dow = gl_obj.day_of_week
print(dow.head())

dow_cat = dow.astype('category')
print(dow_cat.head())









    



0    Thu
1    Fri
2    Sat
3    Mon
4    Tue
Name: day_of_week, dtype: object
0    Thu
1    Fri
2    Sat
3    Mon
4    Tue
Name: day_of_week, dtype: category
Categories (7, object): [Fri, Mon, Sat, Sun, Thu, Tue, Wed]



In [80]:

    
# We can see the integer values associated to column
dow_cat.head().cat.codes









    Out[80]:





0    4
1    0
2    2
3    1
4    5
dtype: int8



In [81]:

    
# We compare the memory usage  
print(mem_usage(dow))
print(mem_usage(dow_cat))









    



9.84 MB
0.16 MB

When we convert columns to category, it's important to be aware of trade-off:

we can't perform numerical computation on category
we should use the category type primarily for object column where less than 50% of the values are unique



In [82]:

    
converted_obj = pd.DataFrame()

for col in gl_obj.columns:
    num_unique_values = len(gl_obj[col].unique())
    num_total_values = len(gl_obj[col])
    if num_unique_values / num_total_values < 0.5:
        converted_obj.loc[:,col] = gl_obj[col].astype('category')
    else:
        converted_obj.loc[:,col] = gl_obj[col]



In [83]:

    
print(mem_usage(gl_obj))
print(mem_usage(converted_obj))

compare_obj = pd.concat([gl_obj.dtypes,converted_obj.dtypes],axis=1)
compare_obj.columns = ['before','after']
compare_obj.apply(pd.Series.value_counts)









    



751.64 MB
42.49 MB






    



Exception ignored in: 'pandas.lib.array_equivalent_object'
TypeError: data type not understood






    Out[83]:






  
    
      
      before
      after
    
  
  
    
      object
      78.0
      NaN
    
    
      category
      NaN
      78.0



In [84]:

    
# Now we combine with the rest of our dataframe (numeric columns)
optimized_gl[converted_obj.columns] = converted_obj

mem_usage(optimized_gl)









    Out[84]:





'94.46 MB'

Convert date



In [85]:

    
date = optimized_gl.date
print(mem_usage(date))
date.head()









    



0.66 MB






    Out[85]:





0    18710504
1    18710505
2    18710506
3    18710508
4    18710509
Name: date, dtype: uint32

We’ll convert using pandas.to_datetime() function, using the format parameter to tell it that our date data is stored YYYY-MM-DD.



In [86]:

    
optimized_gl['date'] = pd.to_datetime(date,format='%Y%m%d')

print(mem_usage(optimized_gl))
optimized_gl.date.head()









    



95.11 MB






    Out[86]:





0   1871-05-04
1   1871-05-05
2   1871-05-06
3   1871-05-08
4   1871-05-09
Name: date, dtype: datetime64[ns]

Selecting Types While Reading the Data In

we can specify the optimal column types when we read the data set in. The pandas.read_csv() function has a few different parameters that allow us to do this. The dtype parameter accepts a dictionary that has (string) column names as the keys and NumPy type objects as the values.



In [93]:

    
dtypes = optimized_gl.drop('date',axis=1).dtypes
dtypes.head()









    Out[93]:





number_of_game       uint8
day_of_week       category
v_name            category
v_league          category
v_game_number        uint8
dtype: object



In [94]:

    
dtypes_col = dtypes.index
dtypes_col









    Out[94]:





Index(['number_of_game', 'day_of_week', 'v_name', 'v_league', 'v_game_number',
       'h_name', 'h_league', 'h_game_number', 'v_score', 'h_score',
       ...
       'h_player_7_name', 'h_player_7_def_pos', 'h_player_8_id',
       'h_player_8_name', 'h_player_8_def_pos', 'h_player_9_id',
       'h_player_9_name', 'h_player_9_def_pos', 'additional_info',
       'acquisition_info'],
      dtype='object', length=160)



In [103]:

    
dtypes_type=[i.name for i in dtypes.values]



In [104]:

    
column_types = dict(zip(dtypes_col, dtypes_type))



In [117]:

    
#Preview of first 10
{k:v for k,v in list(column_types.items())[:10]}









    Out[117]:





{'day_of_week': 'category',
 'h_game_number': 'uint8',
 'h_league': 'category',
 'h_name': 'category',
 'h_score': 'uint8',
 'number_of_game': 'uint8',
 'v_game_number': 'uint8',
 'v_league': 'category',
 'v_name': 'category',
 'v_score': 'uint8'}

Now we can use the dictionary, along with a few parameters for the date to read in the data with the correct types in a few lines:



In [119]:

    
read_and_optimized = pd.read_csv('..\data\game_logs.csv',dtype=column_types,parse_dates=['date'],infer_datetime_format=True)

print(mem_usage(read_and_optimized))
read_and_optimized.head()









    



95.10 MB






    Out[119]:






  
    
      
      date
      number_of_game
      day_of_week
      v_name
      v_league
      v_game_number
      h_name
      h_league
      h_game_number
      v_score
      ...
      h_player_7_name
      h_player_7_def_pos
      h_player_8_id
      h_player_8_name
      h_player_8_def_pos
      h_player_9_id
      h_player_9_name
      h_player_9_def_pos
      additional_info
      acquisition_info
    
  
  
    
      0
      1871-05-04
      0
      Thu
      CL1
      na
      1
      FW1
      na
      1
      0
      ...
      Ed Mincher
      7.0
      mcdej101
      James McDermott
      8.0
      kellb105
      Bill Kelly
      9.0
      NaN
      Y
    
    
      1
      1871-05-05
      0
      Fri
      BS1
      na
      1
      WS3
      na
      1
      20
      ...
      Asa Brainard
      1.0
      burrh101
      Henry Burroughs
      9.0
      berth101
      Henry Berthrong
      8.0
      HTBF
      Y
    
    
      2
      1871-05-06
      0
      Sat
      CL1
      na
      2
      RC1
      na
      1
      12
      ...
      Pony Sager
      6.0
      birdg101
      George Bird
      7.0
      stirg101
      Gat Stires
      9.0
      NaN
      Y
    
    
      3
      1871-05-08
      0
      Mon
      CL1
      na
      3
      CH1
      na
      1
      12
      ...
      Ed Duffy
      6.0
      pinke101
      Ed Pinkham
      5.0
      zettg101
      George Zettlein
      1.0
      NaN
      Y
    
    
      4
      1871-05-09
      0
      Tue
      BS1
      na
      2
      TRO
      na
      1
      9
      ...
      Steve Bellan
      5.0
      pikel101
      Lip Pike
      3.0
      cravb101
      Bill Craver
      6.0
      HTBF
      Y
    
  

5 rows × 161 columns

Analyzing baseball games

how game length has varied over the years



In [136]:

    
import matplotlib.pyplot as plt
optimized_gl['year'] = optimized_gl.date.dt.year
game_lengths = optimized_gl.pivot_table(index='year', values='length_minutes')
game_lengths.reset_index().plot.scatter('year','length_minutes')
plt.show()

	date	day_of_week	v_name	v_league	v_game_number	h_name	h_league	h_game_number	v_score	...	h_player_7_name	h_player_7_def_pos	h_player_8_id	h_player_8_name	h_player_8_def_pos	h_player_9_id	h_player_9_name	h_player_9_def_pos	additional_info	acquisition_info
0	18710504	Thu	CL1	na	1	FW1	na	1	0	...	Ed Mincher	7.0	mcdej101	James McDermott	8.0	kellb105	Bill Kelly	9.0	NaN	Y
1	18710505	Fri	BS1	na	1	WS3	na	1	20	...	Asa Brainard	1.0	burrh101	Henry Burroughs	9.0	berth101	Henry Berthrong	8.0	HTBF	Y
2	18710506	Sat	CL1	na	2	RC1	na	1	12	...	Pony Sager	6.0	birdg101	George Bird	7.0	stirg101	Gat Stires	9.0	NaN	Y
3	18710508	Mon	CL1	na	3	CH1	na	1	12	...	Ed Duffy	6.0	pinke101	Ed Pinkham	5.0	zettg101	George Zettlein	1.0	NaN	Y
4	18710509	Tue	BS1	na	2	TRO	na	1	9	...	Steve Bellan	5.0	pikel101	Lip Pike	3.0	cravb101	Bill Craver	6.0	HTBF	Y

	day_of_week	v_name	v_league	h_name	h_league	day_night	completion	forefeit	protest	park_id	...	h_player_6_id	h_player_6_name	h_player_7_id	h_player_7_name	h_player_8_id	h_player_8_name	h_player_9_id	h_player_9_name	additional_info	acquisition_info
0	Thu	CL1	na	FW1	na	D	NaN	NaN	NaN	FOR01	...	caret101	Tom Carey	mince101	Ed Mincher	mcdej101	James McDermott	kellb105	Bill Kelly	NaN	Y
1	Fri	BS1	na	WS3	na	D	NaN	NaN	NaN	WAS01	...	leona101	Andy Leonard	braia102	Asa Brainard	burrh101	Henry Burroughs	berth101	Henry Berthrong	HTBF	Y
2	Sat	CL1	na	RC1	na	D	NaN	NaN	NaN	RCK01	...	ansoc101	Cap Anson	sagep101	Pony Sager	birdg101	George Bird	stirg101	Gat Stires	NaN	Y
3	Mon	CL1	na	CH1	na	D	NaN	NaN	NaN	CHI01	...	folet101	Tom Foley	duffe101	Ed Duffy	pinke101	Ed Pinkham	zettg101	George Zettlein	NaN	Y
4	Tue	BS1	na	TRO	na	D	NaN	NaN	NaN	TRO01	...	beave101	Edward Beavens	bells101	Steve Bellan	pikel101	Lip Pike	cravb101	Bill Craver	HTBF	Y

	date	number_of_game	v_game_number	h_game_number	v_score	h_score	length_outs	attendance	length_minutes	v_at_bats	...	v_player_9_def_pos	h_player_1_def_pos	h_player_2_def_pos	h_player_3_def_pos	h_player_4_def_pos	h_player_5_def_pos	h_player_6_def_pos	h_player_7_def_pos	h_player_8_def_pos	h_player_9_def_pos
count	1.719070e+05	171907.000000	171907.000000	171907.000000	171907.000000	171907.000000	140841.000000	118877.000000	136701.000000	140838.000000	...	140835.000000	140838.000000	140838.000000	140838.000000	140838.000000	140838.000000	140838.000000	140838.000000	140838.000000	140838.000000
mean	1.953462e+07	0.260897	76.929887	76.953806	4.420582	4.701461	53.619976	20184.247188	150.903329	34.914398	...	1.894778	6.462560	5.914767	6.435770	5.922982	5.786144	5.497628	4.956184	4.080078	1.895873
std	4.149326e+05	0.605667	45.178029	45.162564	3.278489	3.355605	5.571512	14257.381902	34.748160	4.633761	...	1.939072	1.813946	1.946100	2.306364	2.529624	2.521525	2.446669	2.264262	2.167098	1.936534
min	1.871050e+07	0.000000	1.000000	1.000000	0.000000	0.000000	0.000000	0.000000	24.000000	12.000000	...	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000
25%	1.918052e+07	0.000000	38.000000	38.000000	2.000000	2.000000	51.000000	7962.000000	125.000000	32.000000	...	1.000000	5.000000	4.000000	4.000000	3.000000	3.000000	3.000000	3.000000	2.000000	1.000000
50%	1.953053e+07	0.000000	76.000000	76.000000	4.000000	4.000000	54.000000	18639.000000	150.000000	34.000000	...	1.000000	7.000000	6.000000	7.000000	7.000000	6.000000	5.000000	5.000000	4.000000	1.000000
75%	1.989051e+07	0.000000	115.000000	115.000000	6.000000	6.000000	54.000000	31242.000000	173.000000	37.000000	...	1.000000	8.000000	8.000000	8.000000	8.000000	8.000000	8.000000	6.000000	6.000000	1.000000
max	2.016100e+07	3.000000	165.000000	165.000000	49.000000	38.000000	156.000000	99027.000000	1245.000000	85.000000	...	10.000000	10.000000	10.000000	10.000000	10.000000	10.000000	10.000000	10.000000	10.000000	10.000000

	day_of_week	v_name	v_league	h_name	h_league	day_night	completion	forefeit	protest	park_id	...	h_player_6_id	h_player_6_name	h_player_7_id	h_player_7_name	h_player_8_id	h_player_8_name	h_player_9_id	h_player_9_name	additional_info	acquisition_info
count	171907	171907	171907	171907	171907	140150	116	145	180	171907	...	140838	140838	140838	140838	140838	140838	140838	140838	1456	140841
unique	7	148	7	148	7	2	116	3	5	245	...	4774	4720	5253	5197	4760	4710	5193	5142	332	1
top	Sat	CHN	NL	CHN	NL	D	20010616,,0,0,6	H	V	STL07	...	grimc101	Charlie Grimm	grimc101	Charlie Grimm	lopea102	Al Lopez	spahw101	Warren Spahn	HTBF	Y
freq	28891	8870	88866	9024	88867	82724	1	69	90	7022	...	427	427	491	491	676	676	339	339	1112	140841

	date	day_of_week	v_name	v_league	v_game_number	h_name	h_league	h_game_number	v_score	...	h_player_7_name	h_player_7_def_pos	h_player_8_id	h_player_8_name	h_player_8_def_pos	h_player_9_id	h_player_9_name	h_player_9_def_pos	additional_info	acquisition_info
0	1871-05-04	Thu	CL1	na	1	FW1	na	1	0	...	Ed Mincher	7.0	mcdej101	James McDermott	8.0	kellb105	Bill Kelly	9.0	NaN	Y
1	1871-05-05	Fri	BS1	na	1	WS3	na	1	20	...	Asa Brainard	1.0	burrh101	Henry Burroughs	9.0	berth101	Henry Berthrong	8.0	HTBF	Y
2	1871-05-06	Sat	CL1	na	2	RC1	na	1	12	...	Pony Sager	6.0	birdg101	George Bird	7.0	stirg101	Gat Stires	9.0	NaN	Y
3	1871-05-08	Mon	CL1	na	3	CH1	na	1	12	...	Ed Duffy	6.0	pinke101	Ed Pinkham	5.0	zettg101	George Zettlein	1.0	NaN	Y
4	1871-05-09	Tue	BS1	na	2	TRO	na	1	9	...	Steve Bellan	5.0	pikel101	Lip Pike	3.0	cravb101	Bill Craver	6.0	HTBF	Y